Chapter 3 Accessing Data
Data can be delivered in many ways, which makes it necessary to have a multitude of ways to harvest it.
3.1 Flat files
Flat files, like comma or tab separated values (csv, tsv), or values specified by fixed column widths, can be accessed simply, once the format is determined. This is also a very common format is used for very large data, for example, hadoop, where the data is partitioned into many small chunks overlaid with a system to access it.
3.1.1 Example: tuberculosis
The World Health Organisation (WHO) distributes data on tuberculosis case notifications through their web site at https://www.who.int/tb/country/data/download/en/. Download the latest case notifications data set, which is provided in csv format. A data dictionary is also provided explaining the format of the file, and information recorded. The data can be read into R using the code below:
library(tidyverse)
library(DT)
tb <- read_csv(here::here("data", "TB_notifications_2020-01-04.csv"))
datatable(tb %>% top_n(10))This data has 8286 rows and 164 columns. Its not in tidy form yet, so you will see this again in the tidying material to learn how to get a messy data set wrangled. The read_csv function provides a tibble data object in R. An alternative approach to reading is with the base function read.csv which will provide a data.frame object.
3.1.2 Your turn: atmospheric carbon dioxide
Atmospheric carbon dioxide recordings are delivered by the Scripps CO\(_2\) program, and available at https://scrippsco2.ucsd.edu/data/atmospheric_co2/sampling_stations.html.
Choose a station and download or directly read the csv file. You will need to skip over a number of lines at the top of the file in order to read only the data.
3.1.3 Example: standardised test scores for reading, science and math
Every three years the OECD’s Programme for International Student Assessment measures the ability of 15 year old students across the globe in reading, science and mathematics skills. This data is made available at https://www.oecd.org/pisa/. The format that the data is distributed in, is different from survey to survey, with the earliest years being text files, and later years in proprietary formats. The text files have fixed width format, where each variable occupies a fixed set of columns in the text.
Let’s take a look at the 2006 student data. Download the file from the OECD web site – you should find that it is named INT_Stu06_Dec07.txt, or work with the subset of the first 2000 rows that is provided here.
# This is a previously generated data object of the column widths of variables
load(here::here("data", "PISA_student_2006_varwidths.rda"))
# To work with the full data set, use the downloaded file, INT_Sch06_Dec07.zip
# instead of INT_Stu06_Dec07_25k.txt
d <- read_fwf(file=here::here("data", "INT_Stu06_Dec07_2k.txt"),
col_positions=fwf_widths(var_widths$widths,
col_names=var_widths$names))
datatable(d %>% top_n(10))Thelearningtower package available at https://github.com/ropenscilabs/learningtower contains the data for multiple years and code to process the source files.
3.1.4 Your turn: land surface weather station data
The Global Historical Climate Network delivers data from land surface stations across the globe. Records for individual stations are provided using a fixed width format flat file. Find a station near you, and download the file. The list of stations can be found here. Your station file is identified by the station id. For example, the Melbourne airport station ID is ASN00086282, and thus the data file can be downloaded directly from ftp://ftp.ncdc.noaa.gov/pub/data/ghcn/daily/all/ASN00086282.dly.
Documentation on the file format and how to cite the archive is available at https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt. The column widths of the variables can be determined from this information.
- Work out the column widths for each variable
- Write the code to read in this fixed width format
The data is far from tidy, and needs considerable work to get it into a suitable format for analysis.
A more friendly and tider csv formatted data have been made available recently which can be acessed through the https server on the same location.
3.2 Proprietary formats
Data is often delivered in a specialist formats developed by a company, or individual, that has a special encoding scheme. Examples include excel files, SAS or SPSS files. These require decoding in order to read.
3.2.1 Example: Australia air traffic data
The Australian Department of Infrastructure, Transport, Cities and Regional Development provides excel format files summarising air traffic at airports. Download the most recent data, and open the excel file. There are multiple sheets. The third sheet has passenger numbers. There are 6 rows of meta-information, which need to be ignored when reading the data.
library(readxl)
passengers <- read_xlsx(here::here("data", "WebAirport_FY_1986-2019.xlsx"), sheet=3, skip=6)
datatable(passengers)3.2.2 Your turn: Munging spreadsheets
The book Spreaadsheet Munging Strategies by Duncan Garmonsway is a really good source of dealing with complicated excel spreadsheets. We recommend working through examples in this book to familiarise ways to deal with messy spreadsheets, and incorporating information such as special formatting into the data.
The case study developed from Bob Rudis’s post on CDC vaccination data is especially recommended.
3.2.3 Example: SPSS binary
The 2015 and 2018 PISA surveys are available in bniary file format, which is a more complete data format in the sense of variable typing and level descriptions integrated with the data. The R package haven can be used to read foreign binary formats. Download the 2018 data from https://www.oecd.org/pisa/data/2018database/ (494Mb), and use the following code to read it into R. There is a data dictionary on the same site which describes fields in detail. This is really useful for deciding which variables to keep.
library(haven)
stu_qqq <- read_sav(here::here("data", "SPSS_STU_QQQ.zip"))
PISA_au_2018 <- stu_qqq %>%
filter(CNT == "AUS") %>%
select(CNT, ST004D01T, ST012Q01TA, ST013Q01TA, PV1MATH:PV10SCIE)
# Keep country, gender, num TVs, num books,
saveRDS(PISA_au_2018, file=here::here("data", "PISA_au_2018.rds"))3.3 Reading multiple files from a website
Often multiple datasets are available in a web site. It is tedious, and inefficient to manually download each file and read into R. An example is data on the rental market in Tasmania from data.gov.au. The sawfish package makes it easier to download multiple files automatically. The package is only available on GitHub, and needs too be installed using the devtools package. Here is sample code to read all of the xlsx files on Tasmania’s rental market.
library(readxl)
# devtools::install_github("AnthonyEbert/sawfish")
library(sawfish)
url<-"http://data.gov.au/data/dataset/rental-bond-and-rental-data-tasmania-2016-to-2017"
fls <- find_files(url, "xlsx")
# Check that one file can be read nicely
f1 <- tempfile()
download.file(fls[1], f1, mode="wb")
t1 <- read_xlsx(path=f1, sheet=1)
# Now read all files
rentals <- NULL
for (i in 1:length(fls)) {
download.file(fls[i], f1, mode="wb")
t1 <- read_xlsx(path=f1, sheet=1)
rentals <- bind_rows(rentals, t1)
}
datatable(t1 %>% top_n(10))3.4 JSON
With the advent of web communication, comes JavaScript Object Notation (JSON). It is a language-independent data format, and supplants extensible markup language (XML). It is a verbose data descriptor. Here’s an example from wikipedia describing a person:
{
"firstName": "John",
"lastName": "Smith",
"isAlive": true,
"age": 25,
"address": {
"streetAddress": "21 2nd Street",
"city": "New York",
"state": "NY",
"postalCode": "10021-3100"
},
"phoneNumbers":
{
"type": "home",
"number": "212 555-1234"
},
{
"type": "office",
"number": "646 555-4567"
},
{
"type": "mobile",
"number": "123 456-7890"
}
],
"children": [],
"spouse": null
}
3.4.1 Example: Crossrates
An example we have seen is the cross rates data available at https://openexchangerates.org/. To access this data you need to:
- Get a free plan from https://openexchangerates.org/signup/free
- Tell this function your API key –
Sys.setenv("OER_KEY" = "your-key-here")
This code can be used to get the latest 10 days of crossrates.
library(lubridate)
library(jsonlite)
library(glue)
# Function to read in rate for one day, given dates and an API key
getDay <- function(day, key) {
u <- sprintf(
"https://openexchangerates.org/api/historical/%s.json?app_id=%s",
day, key
)
d <- jsonlite::fromJSON(u)
# convert timestamp to date object
# http://stackoverflow.com/questions/13456241/convert-unix-epoch-to-date-object-in-r
rates <- as.data.frame(d$rates)
rates$date <- as.POSIXct(res$timestamp, origin = "1970-01-01")
data.frame(rates)
}
# Read in a set of dates using purrr
rates10 <- days %>%
purrr::map_dfr(~getDay(day = ., key="your-API-key"))
saveRDS(rates10, file=here::here("data","rates.rds"))3.5 Shapefiles
The basic building block of a map is a set of points and an ordering variable which can be used to connect the dots with lines, so that a geographic region can be interpreted. Maps are important because they summarise our physical world, and provide the context for data collected for many different purposes – political, health, environment, commerce.
Map information is often recorded and shared in specialised formats, all of which efficiently keep the structural aspects of spatial data organised. The sf (FIXME:REF) package provides tools to access this data.
Here is an example related to bush fires in Australia. In Victoria, there are five Country Fire Authority (CFA) regions. These are organisational regions for fire management. The shapefiles associated with these regions can be downloaded from http://services.land.vic.gov.au/SpatialDatamart/, but these are also provided with this repo. The format is a folder of several binary files, which collectively describe the map geometry. A good explanation of shapefiles can be found here.
library(sf)
library(tidyverse)
boundaries <- st_read("data/vic_cfa/cfa_region.shp")
Reading layer `cfa_region' from data source
`/Users/cookd/data-technologies/data/vic_cfa/cfa_region.shp'
using driver `ESRI Shapefile'
Simple feature collection with 5 features and 3 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 140.9619 ymin: -39.13658 xmax: 149.9763 ymax: -33.98135
Geodetic CRS: GDA94Shapefiles are typically very detailed, and big in size. For data analysis, these are usually too detailed. Its a good idea to reduce the size, by removing some of the boundary points. This is a tricky task, because some boundary points are more important for defining the boundary than others, and they need to be kept. The rmapshaper package has a thinning function that works nicely to create a small and relatively map accurate shapefile. Note that, the geom_sf function from the sf package can be used to directly plot the shapefile. An alternative, albeit a lot more work, is to convert the object into a tibble of points, group ids, and and ordering variable and plot using the geom_polygon.
library(rmapshaper)
bound_sm <- ms_simplify(boundaries, keep_shapes = TRUE)
ggplot(data=bound_sm) + geom_sf(aes(fill=CFA_REGION))
Shapefiles can also be used to contain point information, for example the locations of fire stations. Locations of fire stations in Victoria can be downloaded from the same Victorian Government site, and overlaid on the polygon data.
fire_stations <- st_read("data/vic_fire_stations/geomark_point.shp") %>%
filter(FEATSUBTYP =="fire station", STATE=="VIC")
Reading layer `geomark_point' from data source
`/Users/cookd/data-technologies/data/vic_fire_stations/geomark_point.shp'
using driver `ESRI Shapefile'
Simple feature collection with 52658 features and 23 fields
Geometry type: MULTIPOINT
Dimension: XY
Bounding box: xmin: 140.502 ymin: -39.13655 xmax: 150.0495 ymax: -33.01086
Geodetic CRS: GDA94
ggplot(data=bound_sm) + geom_sf() +
geom_sf(data=fire_stations, colour="red", shape=3)
3.7 Images as data
An example is http://fivethirtyeight.com/features/a-statistical-analysis-of-the-work-of-bob-ross/
ETC3250 assignments from Di - simplify, maybe not full set of Bob Ross paintings
maybe Susan’s shoeprint?
3.10 Scraping Data
There are a lot of different packages available in R that allow us to scrape data.
Remember that APIs and data services provide a service to us. Always scrape responsibly and don’t violate Terms of Service!
3.10.1 Examples
3.10.1.1 Weather in Ames
Problem
What does today’s temperature tell us about tomorrow’s?
This question is inspired by Rob Hyndman’s answer to it for temperatures in Melbourne as detailed in (Hyndman 1996) and (Rob J. Hyndman 1996).
Data source
The website Weather Underground https://www.wunderground.com/ provides current and historic weather information for weather stations all across the globe. For historic data, each day is summarised by a set of statistics, such as temperature, humidity, pressure, and wind speeds and compared to averages.
The process
We want to scrape the daily information from Weather Underground for a particular station and some time frame.
First, we write a scraper:
# functions
getWeather <- function(station = "KAMW", year = 2015, month = 1, day = 1) {
numbers <- "[-]*[0-9\\.]*[e0-9]*"
units <- "[°a-zA-Z ]+"
require(stringr)
require(rvest)
url <-
sprintf(
"https://www.wunderground.com/history/airport/%s/%s/%s/%s/DailyHistory.html", station, year, month, day
)
html <- read_html(url)
tabs <- html %>% html_nodes("table")
day.stats <- html_table(tabs[[1]], fill = TRUE)
if (is.null(day.stats$Actual)) return(NULL) # no data available
day.stats$Unit <- str_extract(day.stats$Actual, pattern = units)
day.stats$Actual <- as.numeric(str_extract(day.stats$Actual, pattern = numbers))
day.stats$Average <- as.numeric(str_extract(day.stats$Average, pattern = numbers))
day.stats$RecordYear <-
str_extract(day.stats$Record, pattern = "\\([0-9]{4}\\)")
day.stats$RecordYear <- gsub("[\\(\\)]","", day.stats$RecordYear)
day.stats$Record <-
as.numeric(str_extract(day.stats$Record, pattern = numbers))
day.stats$Station <- station
day.stats$Date <- sprintf("%s/%s/%s",year,month,day)
names(day.stats)[1] <- "Statistics"
day.stats
}And now we scrape:
######################
library(lubridate)
# download data for
last <- as.Date("2016-03-01")
first <- as.Date("2008-02-21")
days <- seq(first, last, "day")
filename <- "data/KAMW.csv"
station <- "KAMW"
for (dateIDX in length(days):1) {
# browser()
date <- days[dateIDX]
res <- getWeather(station, year(date), month(date), mday(date))
write.table(res, file=filename, append=file.exists(filename),
col.names=!file.exists(filename),
row.names=F, sep=",")
}And now it’s time for the analysis. Create a vector of the previous days temperatures and plot in a scatterplot. Alternatively, a two-dimensional histogram (using hex binning) also works well in this example, because the focus here is on the high density areas:
weather <- read.csv("data/KAMW.csv", stringsAsFactors = FALSE)
weather$Actual <- as.numeric(weather$Actual)
weather$Average <- as.numeric(weather$Average)
weather$Record <- as.numeric(weather$Record)
temps <- subset(weather, Statistics=="Max Temperature")
temps$Previous <- c(temps$Actual[-1], NA)
library(ggplot2)
ggplot(data = temps, aes(x = Previous, y = Actual)) +
geom_jitter(alpha = I(0.5)) +
geom_smooth() +
xlab("Yesterday's maximum temperature") +
ylab("Today's maximum temperature") +
coord_equal()
ggplot(data = temps, aes(x = Previous, y = Actual)) +
geom_hex() +
xlab("Yesterday's maximum temperature") +
ylab("Today's maximum temperature") +
coord_equal()
What we see is a strong linear pattern, i.e. tomorrow’s maximum temperature is close to what we see today, making it a very strong predictor. There is a peak in density at temperatures around 30 degree Celsius (shown in the binned scatterplot by the light blue tiles). With higher temperatures we see less variability, while at lower temperatures this relationship is more variable.
3.10.1.2 Your turn
The pattern discussed involves maximum temperatures. How would you expect this pattern to change for averages? Explain your expectation first, then check.
Reformat the weather data set in such a way that you can easily compare Maximum and minimum temperatures in a scatterplot.
How do temperatures behave over the course of a year? Organize the data correspondingly and plot.
How does the temperature pattern compare to another place in the world, say e.g. Melbourne, Australia?
Solution for last question
Scrape the data
last <- as.Date("2016-03-01")
first <- as.Date("1995-01-01")
last <- days[dateIDX]
first <- as.Date("1995-01-01")
days <- seq(first, last, "day")
filename <- "data/YMML.csv"
station <- "YMML"
for (dateIDX in length(days):1) {
# browser()
date <- days[dateIDX]
res <- getWeather(station, year(date), month(date), mday(date))
write.table(res, file=filename, append=file.exists(filename),
col.names=!file.exists(filename),
row.names=F, sep=",")
}and plot the data:
#weather <- read.csv("data/YMML.csv", stringsAsFactors = FALSE)
weather <- read_csv("data/YMML.csv")
temps <- subset(weather, Statistics=="Max Temperature")
temps$Previous <- c(temps$Actual[-1], NA)
ggplot(data=temps) +
geom_hex(aes(x=Previous, y=Actual)) + #, fill=log(..value..))) +
xlab("Yesterday's maximum temperature") +
ylab("Today's maximum temperature") +
coord_equal() In the paper by (Rob J. Hyndman 1996) conditional densities are used: given today’s temperature what is the density distribution for tomorrow’s temperatures? We can use high-density region (HDR) boxplots to show the results. The location of the dot indicates the location of the density’s highest mode.
# bin yesterday's temperatures and compute densities of today's temperatures for each level
library(lubridate)
#temps$Month <- month(temps$Date)
#subtemps <- subset(temps, Month %in% c(12,1,2))
subtemps <- na.omit(temps[,c("Previous", "Actual")])
library(hdrcde)
temps.cde <- with(subtemps, cde(Previous, Actual, nxmargin=40))
plot(temps.cde,xlab="Yesterday's maximum temperature",ylab="Today's maximum temperature",plot.fn="hdr", prob=c(50, 95))
3.10.1.3 Weather in Ames (API Version)
In many cases, you are not the first person attempting to use an online dataset for some sort of analysis. Some sites maintain an API (Application Programming Interface) which allows data access via simplified commands. In situations where an API is available (and free), it is generally better to utilize the API than to write a custom scraping function, as typically APIs are more stable than web page structure over time.
In some cases, there are even simplified interfaces to the site’s API: for weather data, the rnoaa package provides R bindings to several APIs maintained by the National Oceanic and Atmospheric Administration.
if (!"rnoaa" %in% installed.packages()) {
install.packages("rnoaa")
}
library(rnoaa)First, we use the meteo_nearby_stations function to identify weather stations near Ames, IA:
lat_lon_df <- data.frame(id = "ames",
latitude = 42.034722,
longitude = -93.62)
# Get weather station closest to Ames
# Not all stations have all variables, so make sure TMAX is included
nearby_stations <- meteo_nearby_stations(
lat_lon_df = lat_lon_df, radius = 10,
limit = 3, var = "TMAX")The meteo_pull_monitors function pulls data from the specified variables for the specified date range, for each station id provided.
## library(dplyr)
# According to https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/readme.txt,
# data are provided in tenths of a degree Celsius
max_temperature <- meteo_pull_monitors(
nearby_stations$ames$id,
var = "TMAX",
date_min = "2008-01-01",
date_max = format.Date(Sys.Date(), "%Y-%m-%d")) %>%
mutate(tmax = tmax/10) # Convert to whole degrees CelsiusFormatting this data for plotting is relatively straightforward:
max_temperature <- max_temperature %>%
mutate(yday_tmax = lag(tmax, 1))
library(ggplot2)
ggplot(data = max_temperature, aes(x = yday_tmax, y = tmax)) +
geom_jitter(alpha = I(0.5)) +
geom_smooth() +
xlab("Yesterday's maximum temperature") +
ylab("Today's maximum temperature") +
coord_equal()
ggplot(data = max_temperature, aes(x = yday_tmax, y = tmax)) +
geom_hex() +
xlab("Yesterday's maximum temperature") +
ylab("Today's maximum temperature") +
coord_equal()
Using the API makes for much simpler code!
3.10.1.4 Your turn
Use the API to get minimum and average daily temperatures in addition to maximum temperatures.
Reformat the weather data set in such a way that you can easily compare maximum and minimum temperatures in a scatterplot.
How do temperatures behave over the course of a year? Organize the data correspondingly and plot.